AI Engineering By Chip Huyen
- This is completely based on AI Engineering By Chip Huyen
Chapter 1 Introduction to Building AI Applications with Foundation Models
Language Models (LMs)
Masked Language Models (MLM)
- Predict missing tokens in a sentence
Example:My favorite __ is blue - Use bidirectional context (both previous and next tokens)
- Common use cases:
- Sentiment analysis
- Text classification
- Code debugging (requires full contextual understanding)
- Example: BERT (Bidirectional Encoder Representations from Transformers)
Generative Language Models
- A language model can generate infinite outputs using a finite vocabulary
- Models capable of producing open-ended outputs are called generative models
- This forms the basis of Generative AI
Learning Paradigms
Self-Supervised Learning
- Labels are automatically derived from the input data itself
Unsupervised Learning
- No explicit labels are required
When to Outsource Model Building
Build In-House When
- The model is critical to business differentiation
- There is a risk of exposing intellectual property to competitors
Outsource When
- It improves productivity and profitability
- It provides better performance and multiple vendor options
Role of AI in Products
Critical vs Complementary
- Complementary:
The application can function without AI (e.g., Gmail) - Critical:
The application cannot function without AI (e.g., face recognition)
Requires high robustness and reliability
Reactive vs Proactive
- Reactive:
Responds to user inputs (e.g., chat-based responses) - Proactive:
Takes initiative (e.g., traffic alerts in navigation apps)
Static vs Dynamic
- Static:
Features are updated only during application upgrades - Dynamic:
Features evolve continuously based on user feedback
Automation in Products (Microsoft Framework)
- Crawl:
Human involvement is required - Walk:
AI assists and interacts with internal employees - Run:
AI directly interacts with end users
Achieving 0–60% automation is relatively easy, but moving from 60–100% automation is extremely challenging.
Three Layers of the AI Stack
Application Development Layer
- AI Interface
- Prompt Engineering
- Context Construction
- Evaluation
Model Development Layer
- Inference Optimization
- Dataset Engineering
- Modeling and Training
- Evaluation
Infrastructure Layer
- Compute Management
- Data Management
- Model Serving
- Monitoring
AI Engineering vs ML Engineering
AI Engineering
- Uses pretrained models
- Works with large-scale models
- Requires higher compute resources
- Workflow:
Product → Data → Model
ML Engineering
- Trains models from scratch
- Requires fewer resources
- Workflow:
Data → Model → Product
Adapting Pretrained Models
- Prompt Engineering
- Fine-tuning
- Training a pretrained model on a new task not seen during pretraining
Training Stages
Pretraining
- Resource-intensive (large data and compute)
- Model weights initialized randomly
- Trained for general text completion
Fine-tuning
- Requires significantly less data and compute
- Adapts the model to task-specific objectives
Post-training
- Often used interchangeably with fine-tuning
- Model developers:
Perform post-training before releasing the model (e.g., instruction-following) - Application developers:
Fine-tune released models for specific downstream tasks
Chapter 2 Understanding Foundation Models
Model Performance Fundamentals
- Model performance depends on both:
- Training process
- Sampling (decoding) strategy
- An AI model is only as good as the data it is trained on
- The “use what we have, not what we want” dataset mindset often leads to:
- Strong performance on training data
- Weak performance on real-world tasks
- Small, high-quality datasets can outperform large, low-quality datasets
RNNs vs Transformers
- Recurrent Neural Networks (RNNs)
- Generate text using a compressed summary of the book
- Struggle with long-range dependencies
- Transformers
- Attend directly to many tokens at once
- Comparable to generating text using entire pages of a book
- Enable long-context modeling via attention
Transformer Architecture
Transformer Block
Each transformer block consists of: - Attention module - MLP (feedforward network)
Transformer-Based Language Model
A typical transformer-based LM includes:
- Embedding Module (pre-transformer)
- Converts tokens into embedding vectors
- Transformer Blocks
- Model Head (post-transformer)
- Converts hidden states into token probabilities
Alternatives to Transformers
- RWKV
- RNN-based architecture
- No fixed context-length limitation
- Performance on extremely long contexts is not guaranteed
- State Space Models (SSMs)
- Designed for long-range memory
- Promising alternative to attention
- Examples:
- Mamba
- Jamba (Hybrid of Transformer and Mamba)
Training Tokens and Scaling Laws
\text{Total Training Tokens} = \text{Epochs} \times \text{Tokens per epoch}
Chinchilla Scaling Law
- Optimal training tokens ≈ 20× model parameters
Research from Microsoft and OpenAI suggests:
- Hyperparameters can be transferred from a 40M model to a 6.7B model
Pre-training and Post-training
- Pre-training
- Equivalent to reading to acquire knowledge
- Resource-intensive
- Produces general-purpose representations
- Post-training
- Equivalent to learning how to use knowledge
- Includes instruction tuning, alignment, and RLHF
- Training only on high-quality data without pre-training is possible, but pre-training followed by post-training yields superior results
Reinforcement Learning from Human Feedback (RLHF)
RLHF is a two-stage process for aligning language models with human preferences:
- Train a reward model that scores the foundation model’s outputs
- Optimize the foundation model to generate responses that maximize the reward model’s score
Preference Data Structure
Instead of using scalar scores (which vary across individuals), RLHF uses comparative preferences:
- Each training example consists of: (prompt, winning\_response, losing\_response)
- This pairwise comparison approach captures relative quality more reliably than absolute ratings
Reward Model Training
Objective
Maximize the score difference between winning and losing responses for each prompt.
Mathematical Formulation
Notation:
- r: reward model
- x: prompt
- y_w: winning response
- y_l: losing response
- s_w = r(x, y_w): reward score for winning response
- s_l = r(x, y_l): reward score for losing response
- \sigma: sigmoid function
Loss Function:
\mathcal{L} = -\log(\sigma(s_w - s_l))
where \sigma(z) = \frac{1}{1 + e^{-z}}
Goal: Minimize this loss function across all preference pairs
Model Architecture Choices
The reward model can be:
- Trained from scratch on preference data
- Fine-tuned from a foundation model (often preferred, as the reward model should ideally match the capability level of the model being optimized)
- A smaller model (can also work effectively in practice, offering computational efficiency)
Key Insights
- Pairwise preferences are more consistent and reliable than absolute ratings
- The sigmoid in the loss function ensures the model learns relative differences in quality
- The reward model’s capacity should generally align with the foundation model it’s evaluating
Sampling Strategies and Model Hallucination
Temperature-Based Sampling
Core Formula
P(token) = \text{softmax}\left(\frac{\text{logits}}{\text{temperature}}\right)
Temperature Effects
- Higher temperature reduces the probability of common tokens, thereby increasing the probability of rare tokens
- This enables more creative and diverse responses
- Standard value: Temperature of 0.7 balances creativity with coherent generation
To identify if a model is learning, examine the probability distributions. If probabilities remain random across training, the model is not learning effectively.
Numerical Stability
To avoid underflow issues (since probabilities can be extremely small), we use log probabilities in practice.
Advanced Sampling Strategies
Top-k Sampling
- Select the k largest logits
- Apply softmax only to these k tokens to compute probabilities
- Benefit: Reduces computational cost and filters out unlikely tokens
Top-p (Nucleus) Sampling
- Find the smallest set of tokens whose cumulative probabilities sum to p
- Process: Sort tokens by probability (descending order) and keep adding until cumulative probability reaches p
- More adaptive than top-k as the number of considered tokens varies
Min-p Sampling
- Set a minimum probability threshold for tokens to be considered
- Any token with probability below this threshold is excluded from sampling
Best-of-N Sampling
- Generate multiple responses for the same prompt
- Select the response with the maximum average probability
- Improves output quality at the cost of increased compute
Handling Structured Outputs
1. Prompting Techniques
- Better prompt engineering with clear instructions
- Two-query approach:
- First query generates the response
- Second query validates the response format
2. Post-Processing
- Identify repeated common mistakes in model outputs
- Write scripts to correct these systematic errors
- Works well when the model produces mostly correct formats with predictable issues
3. Constrained Sampling
- Sample only from a selected set of valid tokens
- Example: In JSON generation, prevent invalid syntax like
{{without an intervening key - Enforces grammatical/syntactic correctness at the token level
4. Test-Time Compute
- Generate multiple candidate responses
- Use a selection mechanism to output the best one
- Trade compute for quality
5. Fine-Tuning
Fine-tuning is the most effective approach for structured outputs:
- Full model fine-tuning (better if resources available)
- Partial fine-tuning (LoRA, adapters, etc.)
- Directly teaches the model the desired output format
Model Hallucination
Definition
Hallucination occurs when a model generates responses that are not grounded in factual information.
Consistency Issues
Due to the probabilistic nature of language models:
- Same prompt can produce multiple different outputs
- This inconsistency can be problematic for production systems
Mitigation Strategies
- Caching: Store responses for repeated queries to ensure consistency
- Sampling adjustments: Modify temperature, top-k, or top-p parameters
- Deterministic seeds: Use fixed random seeds
Even with these techniques, 100% consistency is not guaranteed. Hardware differences in floating-point computation can lead to variations across different systems.
Why Models Hallucinate
Hypothesis 1: Self-Supervision is the Culprit
Problem: Models cannot differentiate between:
- Given data (the prompt)
- Generated data (their own outputs)
Example Scenario
- Prompt: “Who is Chip Huyen?”
- Model generates: “Chip Huyen is an architect”
- Next token generation: The model treats “Chip Huyen is an architect” as a fact, just like the original prompt
- Consequence: If the initial generation is incorrect, the model will continue to justify and build upon the incorrect information
Mitigation Techniques
From RL perspective, train the model to differentiate between:
- Observations about the world (given text/prompt)
- Model’s actions (generated text)
This helps the model maintain awareness of what is given vs. what it produces.
Include both factual and counterfactual examples in training data, explicitly teaching the model to recognize and avoid false information.
Hypothesis 2: Supervision is the Culprit
Problem: Conflict between the model’s internal knowledge and labeler knowledge during SFT (Supervised Fine-Tuning).
The Issue
- During SFT, models are trained on responses written by human labelers
- Labelers use their own knowledge to write responses
- The model may not have access to the same knowledge base
- Result: The model learns to produce responses it cannot properly ground, leading to hallucination
Better Approach
When creating training data:
- Document the information sources used to arrive at the response
- Include reasoning steps that show how the conclusion was reached
- This allows the model to understand not just what to say, but why and based on what information